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(54) Method for evaluating states of biological systems 



(57) A method for evaluating states of biological sys- 
tems comprising the steps of 

a) constructing a pathway comprising at least two 
molecules and their interaction network, 

b) measuring expression data with an appropriate 
experiment and measuring device and 

c) calculating a score for said pathway based on 
said experimental quantification of the amounts of 
molecules in said system, said score indicating an 
intensity of realization of said pathway in said state 
of said biological system. 
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Description 

[0001] The present invention reiates to a method for evaluating states of biological systems. 
Background of thp Invention 

Cems : TJ^^^^^^ I ^^ quantify abundances of mo.ec.es in bio.ogica, 
or proteins (e.g. via mass spectrometry)" T^Z^T^^T**^ ° NA ChipS ° r "^-arrays) 
n.ques, including Yl measurements can now be performed by several established tech- 

• EST sequencing, clustering and counting [26, 27]: 

• micro-arrays [10, 14, 15, 23, 30, 33]; 

• DNA-chips, [7, 22]. 

™-The^ 

on clustering [ 12 , 36], Coarse-gra^n function P Ved^ ^ 24 J *« often bui.ds 

cluster analysis [8], for a particular group of genes Sth unL^Zt T sem - aut °™tically as an extension of 

sri^ir usin9 supervi J ^-^c^crsr that can be » ■»-*■< — * 

clusters are used to restrict the sets of possib.e reactfonl FroJ th~ °- J" 9 * * e ex P ression «me series. The 

tematically as described in [17], simi.arto [20] The resu^s ^0 .iS of n ^ f 8613 PathWayS are c °"*™ted sys- 
does the method provjde a measure fQr LUaCbe^^ as is said in [13,. Neither 

the pathways generated are realized in the cells under invent on Wh!„ ^ ' T ° 6S " make * more ,ike| y that 
express,on level based clustering of the genes a subset of If m * meth ° d is applied startin 3 "om some 
(but not al.) similariy regulated subsets ofgen es rTapi^T^T T " fl ^ d C ° ntains some 
not correspond to complete pathways, but. by de f.oTtion onto If Z d,Sadvan,a 9 es: th * subsets in general do 
null-model is no, exploited to detect significant 222 ^1^ 2? T ? 96065 ^ ,0Und ' & reasonable 
that can be recognized as being realized from the exL^ion 2? T\ ? characteristj C and some pathways 
out example below. 6 ex P ress,on data cannot be found, as will be shown in the worked 

Large Scale Measurements 

.nventay Of g.naa available , 0 an o,gaalsm in ™S 1 fjT' " """^ 9snomes « "<• 

simultaneously. Although the human genome has no, a " UW a " aChed ,ra 9 menl «™P*» 
not ye, been identified, DMA chip technology a.^aS^3™,rr ^ a ' th ° U9h a " human 9 enes "ave 
technology allows to fabricate chips with se^ratZn^ Z^ZTr T" " 96068 ° n one chi P- c " r ^nt 

to 140.000 human genes two to three-fold ^T, DN A ^^^u^lT^' Wh ' Ch C0uld cover the 1 °°<>00 
genes of the eucaryotic organism yeast V Wh ' Ch h °' d fra 9™nts from each of the 6000 
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on such chips or arrays, which makes individual expression levels quite unreliable. With state-of-the-art techniques, a 
two to three-fold increase or decrease in expression level is considered to be a real increase or decrease rather than 
a measuring error. One obvious possibility is to use these measurements to determine the subset of those genes that 
are truly differently expressed and. thus, are related to the phenotypical differences of the compared states. The sub- 
sequent tasks are to exhibit why and how this subset could explain the causes and consequences of these differences. 
In the near future these two questions have to be answered for many such experiments, as the experiments will be 
performed for large numbers of cell states, and the correct and fast answer to these questions for many experiments 
will be of direct scientific, pharmaceutical., and crucial commercial importance and value for companies striving to find 
new innovative treatments for diseases. This implies that the evaluation of such experimental data has to be done in 
large parts with automated computer methods that have been validated and calibrated. 

Metabolic Pathways and Petri Nets 

[0009] In particular time series measurements of systems, i.e. measuring the expression of a large set of - sometimes 
even all - expressed genes for a number of subsequent time intervals, allow to analyze the detailed interaction in known 
pathways as well as to infer new putative relations. Methods have been developed [20] which allow to represent met- 
abolic and regulatory network with suitable graph-like structures, e.g. so called Petri nets [29, 32], and to enumerate 
all possible pathways from the database of known chemical reactions performed by organisms. Pathways can be 
confined to lead from some definable set of starting molecular unils {the reaclants) to another definable set of units 
(the products). Valid pathways can be defined to account for additional biological knowledge and to exclude biologically 
impossible paths in order to substantially restrict the number of all possible paths in the interaction network. 

Summary of the invention 

[0010] The present invention discloses a method for evaluating states of biological systems comprising the steps of 

a) constructing a pathway comprising at least two molecules and their interaction network and 

b) measuring expression data with an appropriate experiment and measuring device 



c) calculating a score for said pathway based on the experimental quantification of the amounts of molecules in 
said system, said score indicating an intensity of realization of said pathway in said state of said biological system. 

[0011] Pathways as used herein are structures that are suitable to describe relevant aspects of some molecules and 
their interactions. In a preferred embodiment, pathways are minimal substructures of complete representations of cells 
that still cover a biologically important process. An example are closed pathways as described in [20]. 
[0012] The invention involves a procedure to evaluate biological and genomic data, specifically measurements of 
the quantitative abundances of sets of molecules in specific cell states. Here abundances are estimated or measured 
as concentrations or expression levels represented as numbers meaning either absolute counts or relative differences 
as compared to some reference state. 

[0013] Suitable molecules are e.g. substrates, small molecules, drug molecules, genes, DNA sequences, mRNA 
molecules, pre-proteins, or proteins. 

[0014] It is known that the comparison of such states, especially the expression levels of genes in such states, can 
yield important information on the differences of different cells on the molecular level. Of great importance is the com- 
parison of diseased and normal cells in order to exhibit or detect the causes and consequences of diseases with the 
final goal of finding possible target genes for drug treatment to remedy a disease or relieve its symptoms. Another 
important application is the investigation of the response of specific cells to treatment with potential drugs in order to 
assess their efficacy or toxicity. 

[0015] In a preferred embodiment, the method of the invention is followed by an estimation of the significance of said 
score of said pathway by the steps 

a) performing the method of claim 1 for at least one other pathway and 

b) comparing the score of said pathway to the score(s) of said other pathway(s). 



[001 6] In one aspect of the invention, it is preferred that the pathways have the same characteristics. A characteristic 
of a pathway, as used herein, is defined as any quantitative property of the abstract interaction network described by 
the pathway. In a preferred embodiment, this property is of biological relevance or the scoring function is sensitive to 
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[0022] Thd mnmnd of ,nd p,.«„, ,„„ n , on „ ,„„ epcffla „ y 

• .« id.n,„ v p.^ tha , «, „ WosM) , rM|ijM ^ mBsing ^ a ^ ^ ^ 

Detailed desc ription of the inv^ntinn 

Petri nets are well studied graph like oonS2Th£t» lnterac,,on networks in the from of Petri nets f11 29 321 
Petri nets are especially weLSited ^Zn^ZTZZt T "I eXtenS ' Ve ' we «-tab.ished theo * 
forward way [31J. The available knowledge on molecl relaLnSin 9 ^ [ e,a,,onshi P s in a natural and straight- 
jn a uniform .anguage and, additional itfs a^^^S^^ * P-W n<3tS ' be ,0miulated 

or the investigation of biochemical pathways ^ess.ble to graph and simulation algorithms that are useful 

are indeed realized by a system under certain "II" ™ '° 9 ,ca,s y stems - Whether or not these pathways 

on reactions and interactions exploited for the SK^SS^S^^ 1 be deriVed fram ™ database's 
are increasingly becoming available by the above mwESES) al0ne - ™ e data needed t0 address these questions 
[0025] ln one embod,ment, this l»JLn^SS?S^^ measurements on large sets of genes 
scoring device and which al.ows to find reafe^ eXpr6SSi0n data via a «a. 

allows for an evaluation of the raw expression datT eTsed Zhf measurement s and ' at the sa ™ time, 

complete pathways, the procedure aHows for uJS^^Z^JST^^ ^ 6XpreSSi0n va,ues f ° 
be worthwhile starting points for new pathways tote^Z^Z^T* *' Pa ' hWa » Which could 
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tigated (e.g. cell types, diseases, etc.) can easily be phased into the process by globally specifying the validated or 
hypothesized knowledge as a user defined interaction network can easily be phased into the process by globally spec- 
ifying the validated or hypothesized knowledge as a user defined interaction network. 

[0027] Petri nets allow for the manual modification of generated or previously constructed networks in order to aug- 
ment the network by specific individual knowledge of experts. This directly enables experts to add additional facts, to 
formulate hypotheses, and to specify contradicting alternatives. This information, in the form of an extended network, 
can be used to evaluate a whole range of experiments as described above. The different hypotheses or alternatives 
can be evaluated and thus allow to identify possible targets for interrupting or stimulating specific pathways. 
[0028] The method of the present invention explicitly allows for feeding hypotheses or biological intuition or pharma- 
ceutical ideas on potential targets into a method for target finding. The system is able to evaluate the proposed hy- 
potheses together with the established knowledge against the new experimental evidence given by the expression 
level measurements. Alternative or contradicting hypotheses can be weighted against each other in the above context 
and, thus, the best alternative can be selected. Ideally, such a hypothesis is a complete pathway considered to be 
important for the biological system under investigation and providing hints for possible targets. Furthermore, an iterative 
process can be performed, which, based on previous hypotheses and the outcome of the corresponding expression 
experiments, allows to optimally design new experiments which further enhance the knowledge on the system and 
finally validates target candidates as far as is possible with this kind of experiments and analysis. 
[0029] Due to the error rates of the current expression level measurements, methods that rely on the comparison of 
individual gene products are bound to be very unreliable themselves. This is a major drawback of the useful evaluation 
of expression data. 

[0030] An important feature of the present invention is to combine the generation of possible and plausible pathways 
with the evaluation of expression data using a statistical score, which rates a complete or partial pathway with respect 
to the measured expression data. This score may be compared to the scope of all other possible pathways or to those 
of random pathways. The score combines evidence from a complete set of measurements, each of which might be 
quite unreliable. Thereby, the score relies on many measurements and their relative difference to a large set of other 
measurements. Additionally, the score evaluates complete biological units as compared to individual reactions and 
units. 

[0031] There are many ways to phase information about the topology of the pathway and their semantics into the 
calculation of the individual scores and the scores for complete pathways. For example, the expression levels should, 
in most cases, be the more correlated, the closer the respective gene products are on the path. Furthermore, if additional 
knowledge on the function of gene products is available or can be derived from the network, this could be indicative 
of whether a significant change in expression level should be expected or not. This can be taken into account via the 
specific design of the scoring function. 

[0032] Figure 1 discloses a scheme of the embodiments of the invention. 
[0033] One embodiment of the invention is disclosed below 

A: Pathway Generation as disclosed in further detail in [20] and [21] (incorporated by reference) 



A1 : Compile the available knowledge on biologically relevant reactions and interactions between gene products 
including hypotheses into a graph-like notation (Petri nets); 

A2: Compute all biologically possible paths which are of interest in a specific pharmaceutical or biological context; 



A3: Generate a set of random paths of similar characteristics (e.g. length, size and diameter). The result is a set 
of pathways, i.e. complete sets of genes. 

B: Scoring Pathways 



B1 : Measure the expression levels with an appropriate experiment and measuring device and determine normal 
ized differences between the states to be compared for each gene product; 



B2: Compute the score for any possible pathway with respect to the statistical model, the actual expression level 
differences and the topology of the respective pathway; 
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2£^S£^2~-e« — — ~- »». 

C. Evaluation of P assion Data 

chams of reactions and interactions be^eTn tk ?' tOP ° ,09y °' the P a,h s P e *<ies very de ai ed 
s.and.ng the differences between statesTn the conte^ To thet ^t^'^ 6 ' ,h ° Se pathwavs ar * a good basis to u nder 
as ava.lable for the pathway generation o, stepsTi 3 meaSUrement ™* »• —nt biological knowledge 

Luuj/j This allows for: 

CI : Design „, ,„„„„, , wres « ton measu|sme „, s a ^ 
Figures 

T». p.thwy l0 , , palh oonlai „ s • d *2T * ™ T •""*• * S ' nk 4 =. respectively 

~ c,«, „ P „ h _ (lh . suta „ ,^L~srrxr: ™: £ 

a. cernpu.ed „„„ ,„, enun , era „ on « a ™» «n ol , ton, 0-s,„cos. ,o Pyn,™. 

; = yco,ys, 

at on algonthm described in [21 ]. The width has to be a, .ealt 2 fn ordTTT 2 85 C ° mputed with th * enumer 
oo^ r Va ' id P3thWays ' Note ,hat 'dividual pathwlys^annot be H i, T textb °° k -9lyco.ysis (thick .ines) 

[0044] Rgure 7: Differential Metabolic Disp.ay (OMD) of the atrl d ' St ; ngU,Shed in this ">™ °' illustration. ' 

genomes containing all pathways of width 2 for a I n»thl f glycolysis for yeast and MG (Mycoplasma Genitalium , 

This F.gure contains a„ enzymes of F g^ 

pathways present in both organisms (MG and yeast thin h^,,, , ' n,0rmation is av *i'a°'e The thick edges indicate 

No Pathways are present in MG anj not TyZZ ht .J^STST but 

°004 y6aSt ° r MG - ,ndiCale enZymeS known from «*er 

-p ; Jn ^Li^jTzzizi'z^rj^r ana ,he — - « - «— 
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glycolysis pathway. 

[0050] Figure 13: Histogram of the TCA cycle scores calculated similarly as in Figure 11 . 

[0051] Figure 14: Histogram of scores of random pathways of the same size as the TCA cycle pathway, analogous 
to Figure 12. 

[0052] Figure 15: Graphical illustration of the time series of expression levels of a TCA pathway (i.e. assignment of 
ORFs to the TCA cycle transitions). Each line corresponds to the expression of one of the genes involved in the example 
pathway during the seven time points. 

[0053] Table 1 : Expression data as used for the score calculation of a pathway consisting of the ten genes shown. 
The data are taken from [10]. There is one row of data for each gene included in the pathway as identified in the first 
column. Each data column corresponds to a time point from t1 to t7. The data shown are the values r tg as defined in 
Equation 1 : the logarithms of the ratios of the measured gene expression at the indicated time point to the expression 
at the base time point to. 

[0054] Table 2: Mean values for the simple pathway-models as described in Equation 2, calculated from the data 
shown in Table 1. In order to avoid the influence of self -correlation on the scores of genes included in the pathway, 
each gene is removed from the pathway before the particular pathway-model is built that is used for the computation 
of the score of the respective gene. 

[0055] Table 3: Empirical standard deviations for pathway-models according to Equation 3 similar to Table 2. 
[0056] Table 4: Mean values and empirical standard deviations for the null-models. The parameters of a null-model 
for each time poinl 11 to 17 is calculated from the data described in [10] using Equations 4 and 5. 
[0057] Table 5: The scores for the genes included in the pathway for the different time points, computed according 
to Equation 8. The values in the last column, titled average, correspond to the gene scores according to Equation 9. 

Description of Methods 

Pathways 

[0058] In order to facilitate a system (e.g. cell-, tissue-, organism-, or species-) wide, holistic evaluation of sequence 
and expression d^tn wo compiled the available data of metabolic databases into Petri nets. Petri nets are graph-like 
structures that lend -honsotvos naturally to representing all kinds of relations and interconnections of distributed in- 
teracting entities (substrates proteins) in a metabolic/regulatory network. In the context of this invention, Petri nets 
derived from avat ab c carcases and additional expert input are used to provide the biological background knowledge 
for the analysis o' expression data especially in order 

• to merge all available databases to integrate the stored biochemical facts and to remove inconsistencies, 

• to generate (if acs»rcc cell type specifically) all putative pathways that can be subjected to our new method to 
evaluate pathways by expression data, 

• to define and analyse interaction networks by their underlying structure of paths and pathways, 

• to compare genomic and expression information with knowledge about interaction networks and 

• to define a notion o' Diftcrcnual Metabolic Display (DMD) that allows to compare specific systems, i.e. organisms, 
developmental or disease states, by comparison of the individual Petri nets. 

[0059] The main sources of information about biochemical pathways are databases like BRENDA [19], ENZYME 
[2], KEGG/LENZYME [25]. MPW j34|. WIT [28], EcoCyc [19], and HincCyc [18] containing textual descriptions of re- 
actions. Regulatory relations are inferred from sequence database annotations (i.e. Swissprot [3], Prosite [4]) or from 
literature abstracts (Medline http.//www. ncbi.nlm.nih.gov/entrez/). 

[0060] The compilation process of the different databases used and the removal of mistakes and inconsistencies 
and the unification of the database format is described in detail in [21]. 

[0061] The main purpose of the compiled Petri nets for pathway databases is the systematic generation of paths 
and pathways in such nets to facilitate the analysis of differences between certain environmental states, between 
different organisms (genomes) and between different cell types of one organism. Petri nets with their underlying se- 
mantics [21] (the so called "firing rule") and additional user defined and biologically motivated restrictions [21] enable 
to drastically reduce the number of valid paths leading from a set of educts to a set of products. 
[0062] Based on such restricted valid paths, a concept of pathways is defined: Given a Petri net, a path way associated 
with a path is a partial net that contains the path and is closed. Closed paths account for the availability of educts and 
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^r^sfotr^ 

a^on of expression data ^Z^Ztio^^TX 9 ^ «" -P«* pathways, and to enable the evalu- 
10066? SET 8 r 6inS ° r ~ Ssir" fUnCti ° n l ° ™™ »» existencraX 
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a uniform Petri net setup (Figure 4). Thus, the fJ^^STS ^ ? ^ ,0 make ,hem 
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Example 
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some 800 enzymes (no, shown). Exploiting additionally f ? S (Sti,,) ab ° Ut 80 000 pa,hs inv °'™9 

Pe "J^ST^ « and 7) , after having mapped the seguence data can 

seouf G) 9en ° meS 550 Pathwa ys with ZzfTsZ^TX^ ^ '° r ^ /eaSt and 

fnTT ' ° rmati ° n °' 1 85 °' the 225 Byrnes could be aiioned rL , T databases ' ° ut ° f 
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l-nes are found only in yeast pathways but not in MG SZfE ° th or 9 a "isms (MG and yeast) thin black 
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• Putative pathways that could or could not be realized in these states. 
• Answer the Questions: 

• Which pathways are in fact realized in the cells? 

• Which genes do not have much support by the current measurements to belong to the pathway ? 

• Which genes not included in the putative pathways are likely to be related to the pathways? 



• By producing the Output: 

For each gene, both included and not included in the pathway, a score how well it fits the putative pathway 
according to the expression data. 



• For the putative pathway as a whole, a statistical score how probable it is that the pathway is realized in the 
cell type under investigation. 

[0073] The basic idea is to rale the genes involved in a putative pathway, as well as the remaining genes, with respect 
20 to this pathway according to the behaviour of the expression of all genes. In order to do so, two statistical models are 
constructed, one model of the expression of the genes included in the pathway (pathway-model), and a second model 
of the remaining genes (null-model or background-model). Each gene, whether included in the putative pathway or 
not, can be compared to both models, and a score can be computed that reflects how much better it fits the pathway- 
model than the null-model and vice versa. If this is desired, one of the models can be chosen to be uniform, i.e. assigns 
25 equal probability to every gene, disregarding the observed expression behavior. This is amounts to omitting the cor- 
responding model. 

[0074] According to this idea, not only a given pathway itself can be rated, but in addition each gene from the pathway 
individually as well as each remaining gene can be rated with respect to expression correlation to the pathway. This 
offers the opportunity to augment the knowledge about the pathways by identifying, on one hand, not similarly expressed 
30 genes within the pathway and, on the other hand, similarly expressed genes that are not yet linked to the pathway. 
This works the better, the stronger the gene expression experiments involves the regulation of the pathway under 
investigation, and the better the models of expression behavior reflect the biological reality. Thus, the models should 
be calibrated using available measurement data. 

35 Definition of the Scoring System 

[0075] Thus, for a concrete application of this principle to a putative pathway, crucial technical choices have to be 
made regarding two points: 

[0076] First, a number of gene expression assessment experiments have to be selected from the data sets already 
^o available, or have to be newly designed and performed. Additionally, relative weights can be assigned to the different 
experiments or, in the case of time series measurements, even to individual time points. 

[0077] Second, a gene pathway scoring (GPS) function has to be defined, that assigns to each gene a score that 
reflects its correlation to the genes belonging to the pathway as opposed to the remaining genes. This is closely tied 
to the definition of the mathematical models for these two sets of genes. In the case of probabilistic models, the log- 

45 odds ratio of the probabilities (as shown in the example below, Equation 8) is a natural choice for the scoring function. 
A good scoring function has to reflect expression behavior resulting from the different types of possible biological 
connections. With respect to expression time series features like proportionality (common regulation), reciprocal pro- 
portionality (synchronized regulation), time delayed correlation (one side regulates the other side) and the like can be 
exploited in order to capture complex regulatory relations. On the side of the pathways, the graph structure of the 

so involved genes as given by the pathway can be taken into account, for example by giving more weight to the influence 
of the correlation of genes that are close to each other in terms of the shortest path between them. However, in the 
example below it is demonstrated that a comparatively simple function that corresponds to a simple pathway-model 
already leads to reasonable results. 

[0078] Depending on the definition of the scoring function, the score distribution may depend on the characteristics 
55 of the pathways scored, most importantly the size. This hampers the comparison of scores of pathways of different 
characteristics. Statistical scores, called p-values (probability estimates) or E-values (expectation values), that remedy 
analogous problems in the field of sequence comparison have been an important pre-requisite for the success of 
programs like BLAST and FASTA for that application. These scores, in addition to increasing the reliability of decisions 
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P-ays ^^^^^^-^^J^^,., 

Ratinn 
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[0079] 



used to .dentify genes that are possibiy related to ,L pathway 96068 inC ' Uded ,he Pa,hWay can be 

~scores ^^p.^^..,.,..^^ 

-f=:^:~ 

computed by applying steps 1-3 of this pr^^^^l^^ A ?P'°P"ate random scores can be 

length, width etc.) of the pathway undeMnvestSatlon " ^ Sharin9 the characteristics (size 

subSuo the m e 6 x ZtZZT^ ** hyP ° theSiS that "» P^ * ™— in the type of ce.ls that were 



30 f*a"P'e realization of the scoring function 
[0080] 
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http://cmgm.standford.edu/pbrownLplorSeThtmr by ° eRiSi * * f10) ,hat is 

represented as real numbers (see tabte below o Te J™ ?°* n VeaSt 9ene 9 ,here ™ rneasuremen s / 

senes ,s optimally suited for the rating of the qTvco vsi ! nl ' eVe ' f ° r 8 S6t of different «™ points t This fime" 

2E? m6asured — - — ™ 

[0081] Investigation of the distribution of the relative r ha n , 



'03 ^, 



each gene * In the following, ,e, ,„ denote the log-relative expression .evels 



r t, g = log ^ 



(1) 
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This leads to a distribution of values that is . • 

distributions are not necessarily norn^ZTSSC ? f ? ^ d — ^^on. Whi.e the results 
s.gmo,d, andthe density functions are »hit^ElJ. d r* ,h ^ n ° rma ' dist ^tions. TheS 
by normal distributions without making a qualitative e m^^ 

the models used in the scoring function. " ' S take " advanta 9 e <* this observation in order to constrS 
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[0082] For each time point, two sets of expression values are collected, corresponding to the set of genes involved 
in the pathway (denoted p) and the set of remaining genes (called p), respectively. For both sets, the tog ratios described 
above are fitted to normal distributions by simply taking the mean rand the empirical standard deviation s of the sets. 
As an example, equations 2 and 3 show, how this can be done for the set of genes p belonging to the path for time point t 



10 



[0083] Analogously, the mean r~ , and the standard deviation s' p t can be computed for the set of genes p not be- 
longing to the path for each time point £ 

20 



\p\7t 



(4) 



25 



30 



•* "Mi*'"-''** (5> 



[0084] For each gene g, a score is computed that reflects how well it fits the path. For this purpose, the gene is 
35 removed from the set it is assumed to belong to (either poxp), resulting in the sets p-{gr}and p-{g}. This is most important 
when the size of the set is small and the presence of the gene has a considerable effect on the estimated distribution. 
First, an estimation Pfor the probability of the gene g to belong to the path p-{g} or to the set of remaining genes p- 
[g} : respectively, is approximated using the normal distribution <t>. This is done for each time point t individually. 

40 



^.p(g|p-k}):=2*<D 



I,* - r '.M*> 



(6) 



50 



P.. P (g\p-{g])^2*<t> 



S '.p-{g\ 



(7) 



55 [0085] This definition of the probabilities rests on the assumption that it is the more probable that a set forms a 
pathway that is realized in the investigated cell types, the more correlated (e.g. proportional) the expression of this set 
of genes is. This is especially true for the pathway model, whereas the null-model is reasonably justified by the empirical 
log-ratio distribution observed above. In general, synchronization of expression can be more sophisticated than mere 
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proportionality, and accordingly more elaborate m^i , 

SZ s l these de,ini,ions ^ZZS££&£^ rr be devised as m — 

[0086] The score of g is calculated as the log-odds score oMhT US6fUl aS Shown below - 

t,at,on to belong to the path ^,or to thesefofre^^ 

^^p(P).= log^^: 

p t. P (9\~P-{9)) (8) 

•"-w^Z^.w ( 9) 
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ao D r T d bV 9raPh StrUCtUre thal ^ he IS * ^ Wmw *' fr0m 9 ,uco ^ *» ■ number 

[0091, This ^ciZ^l^^^' f ,h6 ^ ™™ as bold In Figure 6. 
number of different pathways on the basis 3fc3^2S2^?° n,) 38 ,ranSiti ° nS ' 9enenca ^ ^ents a 

of Proteins to the pathway J^SS^TSS^ °°^«*9 <° one possible assignment 

fromThe K P ° imS ^ r6SpeCt 10 the base t^i^n^^r 08 ° f eXPr6SSi0n Va ' UeS measured a 
SSt? I T 6 ViSUa,i2ed as time ctJfves 'n Figure 10 expression assessment [10]. The data 

L J in Z IIKSSSL^ as an exampie case. F rom the 

ioo'^^ 

values of the gUes not inc ud e d n 2S^?!?1? nU,| - m ° de,S Ca " be ««P— . Therefore, the expression 
in [1 0, are needed (this expression 5*^^^^^^ ° f tbe k "-n yeast geneslnves^ e °d 
he stat ,st,csof ^ this set .^'^tmrnmn^i^S^^^ * "- e9leCtab,e inf,ue "~ °" 

A«or d ln gly , this patnway „ assjgned he a ~ o,?^" mC ' Uded lhe P-** are revea.ed [shown in Tabie 5, 

br - ss^xcr: r s ° f sets - — — - « * 

[0098] ln contrast, the analogous computet^ oTsc^T V ° d,s,ribu,i °" <* scores shown in Figure 11 
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[0099] It is easy to see that the scores of the glycolysis pathway lie well above the scores expected from random 
paths. This confirms the hypothesis that the glycolysis pathway is realized in the investigated states of yeast. 
[01 00] This result can not be achieved using clustering methods, because the genes encoding the involved enzymes 
5 are not similarly regulated (at least this does not manifest in the current measurements), as was already observed in 
[10] and can be seen in Figure 10. 

[0101] The method of the present invention, even with the simple example statistical model as described in the 
previous section, can recognize realized pathways with heterogeneous regulation. 

[0102] For the example glycolysis pathway defined above, a p-value of 0.0009can be derived from the random score 
10 distribution by determining the fraction of random pathways that score equal or better than score p . This is a very good 
result and is - given the data shown in Figure 10 - very hard to match with clustering based methods. 
[0103] Another example illustrates this point: For the textbook tricarboxylic acid (TCA) cycle, a supposedly easier 
example, our method performs even better. Excellent scores as shown in Figure 13 are achieved. Here, even the 
lowest TCA pathway score is better than the highest score of 10000 random pathways with equal length (shown in 
15 Figure 14), whereas, again, the accompanying expression level time series (Figure 15) do not cluster together easily 
in non-trivial discriminating clusterings. 

Experiment Design 

20 [0104] The above methods for deriving and representing networks : the generation of pathways with specific char- 
acteristics and for the subsequent calculation of scores can be applied for the subsequent calculation of scores can 
be applied for improvingthe design of further experiments and experimental measurements, by performing the following 
steps: 

25 • measuring the new data in the expression experiment in order to provide the enhanced discrimination between 
the various hypotheses to be tested 

• designing the experiments based on the hypotheses fed into the system, e.g. formulated in augmented Petri nets 
30 • designing the experiments to account for the type of statistical score used for the subsequent evaluation 

• planning the experimental setups such that the already measured data is used to avoid unnecessary experimental 
duplication 

35 • placing normalization measurements at crucial points in the experimental setup to allow for optimal usage of pre- 
cious material, i.e. patient tissue of certain disease states 



40 



• connecting measurements made on readily available in vitro material with measurements on in vivo material for 
the evaluation 

• designing additional experiments such that ambiguities in the scoring based on the previous experiments alone 
are removed and such that the resulting statistical score is optimized 

• designing the experiments such that the DNA chips or other experimental equipment is used efficiently, i.e. the 
45 number of consumed resources is minimized for the information obtained. 
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10. The method of claims 1 to 9, wherein the pathways are constructed from established biological knowledge and/or 
from hypotheses. 

11. The method of claims 1 to 10, wherein Petri nets are used for the construction of the pathways. 

12. The method of claims 1 to 11, wherein at least two states of one biological system or the states of at least two 
biological systems are compared. 

13. The method of claims 1 to 12 

to identify pathways that are biologically realized in only one, some, or all biological systems under investiga- 
tion. 

to identify pathways that are biologically realized or missing in a disease states, 

for identifying molecules that do not form part of the complete pathway corresponding to a given pathway, or 
for identifying molecules that form part of the complete pathway corresponding to a given pathway. 

14. A method as defined in claims 1 to 13, taking into account the type of statistical score used for the subsequent 
evaluation according to claims 1 to 13, applied for enhancing and planning the design of experiments by 

• planning the experimental setups such that the already measured data is used to avoid unnecessary experi- 
mental duplication and such that experimental equipment is used efficiently, 

• placing normalization measurements at crucial points in the experimental setup to allow for efficient usage of 
precious material, i.e. patient tissue of certain disease states, by connecting measurements made with readily 
available in vitro material with measurements on in vivo material for the evaluation, or 

• designing additional experiments such that ambiguities in the scoring based on the previous experiments 
alone are resolved and such that the resulting statistical score based also on the additional measurements is 
optimized and discriminating between specific pathways. 

15. An iterative method in particular according to the claims 1 to 14 for intertwining the hypotheses formulation and 
experiment design, comprising the steps of 

• selecting the most plausible pathways according to the current experimental data with the methods of claims 
1 to 1 4 claimed above, 

• modifying and enhancing the interesting pathways based on this analysis with new formalized hypotheses, 

• deriving new experimental setups, which discriminate between alternative and/or contradicting hypotheses 

• iterating these steps until enough information on potential target candidates has been assembled to proceed 
to subsequent steps of target validation and drug development or the network cannot be reliably further ex- 
tended in step 2 above. 
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Figure 1: 
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Figure 2: 




max. length = 5 reactions 
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Figure 7: 
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Figure 14: 
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Table 1: 



gene (yeast ORF ID) 


tl 


t2 


t3 


t4 


15 


16 


t7 


YCL040W 


-0,644 


-0.234 


0.807 


0,782 


1.836 


1.945 


1.251 


YBR196C 


0,379 


0,782 


1,118 


0.475 


0,454 


-0,12 


-0.494 


YDR050C 


0,084 


-0,029 


0,356 


0,202 


0,057 


0,084 


-1,12 


YGR240C 


0.263 


0.536 


0.433 


-0.184 


-0,152 


-0.943 


-0,786 


YKL060C 


0.138 


0,138 


0.07 


-0.014 


0,043 


-0,218 


-1.252 


YJL052W 


0.163 


0.251 


0.782 


0,433 


0,454 


0.299 


-0.621 


YCR012W 


0.084 


0.163 


0.757 


0.623 


0.475 


0,124 


-0.515 


YDL021W 


-0.218 


-0.074 


0,516 


0,864 


1.029 


1.599 


1,287 


YHR174W 


0.084 


0.239 


0,251 


0.057 


0,239 


-0,434 


-1,252 


YOR347C 


-0.152 


0.322 


0.536 


0,687 


0.918 


1,395 


1.029 
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Table 2: 



{pathway 




P-{YCL040W) 


n 

0.092 


p-{YBRt96C) 


•O.022 


P - (YDR050C) 


0.0 U 


P - {YGR240CJ 


A 009 


P-{YKL060CJ 


0.005 


P-{YJL052W) 


0.002 


P-fYCR0l2W] 


0.011 


P-(YOL021W) 


0.044 


P-{YHR174WJ 


o.on 


p-fYOR347C) 


0 037 



t2 



0,259 

0,146 

0,236 

0,173 

0,217 

0.205 

0,215 

0.241 

0,206 

0,197 



t3 
0.535 
0.501 
0,586 
0,577 
0.617 
0.538 
0,541 
0,568 
0.597 
0,566 



t4 
0,349 



0,383 
0,414 
0,457 
0,438 
0.388 
0,367 
0,340 
0,430 
0,360 



t5 



0,391 
0,544 
0,588 
0,612 
0,590 
0.544 
0,542 
0,480 
0,568 
0,493 



te 



0,198 

0,428 

0,405 

0,519 

0,439 

0.381 

0,401 

0,237 

0,463 

0,260 



t7 



-0,414 
-0,220 
-0,150 

0,187 
-0,136 
-0,206 
-0,218 
-0.418 
-0,136 
-0.389 



DOCID: <E p 1158447A1J_» 



34 



EP 1 158 447 A1 



Table 3: 



pathway 


t1 


t2 


t3 


14 


t5 


t6 


t7 


p - {YCL040W} 


0.185 


0.267 


0,314 


0.354 


0,395 


0.822 


0.941 


p-{YBR196C) 


0.278 


0,231 


0.253 


0,381 


0,625 


0.992 


1,090 


p - {YDR050C} 


0,308 


0,302 


0.318 


0,376 


0,600 


1.003 


1,044 


p - {YGR240C} 


0.295 


0,290 


0,324 


0,316 


0.571 


0,882 


1.075 


p - (YKL060C} 


0,305 


0.313 


0.271 


0.351 


0.598 


0.985 


1.028 


p - {YJL052W} 


0.304 


0,314 


0,317 


0,382 


0,625 


1.009 


1.085 


p-{YCR012W} 


0.308 


0,314 


0.319 


0,372 


0,625 


1.005 


1.089 


p-{YDL021W) 


0.296 


0.296 


0.327 


0.339 


0,598 


0,900 


0.932 


p - (YHR174W) 


0.308 


0,314 


0.306 


0,361 


0,616 


0,963 


1.028 


p - {YOR347C} 


0,302 


0.312 


0,327 


0,366 


0,609 


0,934 


0.985 
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Table 4: 





t1 


t2 












lmean_r 


-0.089 
0.251 


-0.035 
0.283 


t3 
0.116 
0.317 


(4 
0.197 
0.342 


t5 
0.242 
0.388 


*6 
0,089 
0.821 


t7 
0,202 
0,886 



36 

DOC,D: <E p J158447A1J_> 



EP 1 158 447 A1 



Table 5: 



gene 


tl 


t2 


13 


14 


t5 


t6 


t7 


average 


YCL040W 


-5,929 


-1,998 


2,584 


3,964 


7,908 


0,346 


-0.274 


0.943 


YBR196C 


0,870 


0.401 


2,250 


2,797 


2,496 


-0,319 


0,078 


1.225 


YDR050C 


0,503 


-0.951 


0,046 


0.B57 


-0,161 


-0,284 


0,164 


0.025 


YGR240C 


0.794 


1.570 


0.726 


-3,122 


-1.507 


-0,763 


0.126 


-0.311 


YKL060C 


0,593 


0,391 


-3,017 


-1,095 


-0,250 


-0,339 


0,163 


-0.508 


YJL052W 


0,636 


1,039 


2,51 B 


2.629 


2.496 


0.158 


0,099 


1,368 


YCR012W 


0,503 


0,585 


2.447 


3,396 


2,649 


-0,210 


0,081 


1,350 


YDL021W 


-0,481 


-1,131 


1,440 


4,160 


5,825 


0,682 


-0,318 


1.454 


YHR174W 


0.503 


1.012 


-0.957 


-0.415 


1,013 


-0,398 


0,163 


0.132 


YOR347C 


-0.411 


1.199 


1,611 


3,642 


5.153 


0.699 


-0,093 


1.686 


average 


-0,242 


0,212 


0,965 


1,681 


2,562 


-0.043 


0,019 


0.736 
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